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Abstract 

Background: The C2H2 zinc-finger (ZNF) containing gene family is one of the largest and most complex gene 
families in metazoan genomes. These genes are known to exist in almost all eukaryotes, and they constitute a 
major subset of eukaryotic transcription factors. The genes of this family usually occur as clusters in genomes and 
are thought to have undergone a massive expansion in vertebrates by multiple tandem duplication events 
(BMC Evol Biol 8:176, 2008). 

Results: In this study, we combined two popular approaches for homolog detection, Reciprocal Best Hit (RBH) 
(Proc Natl Acad Sci USA 95:6239-6244, 1998) and Hidden-Markov model (HMM) profiles search (Bioinformatics 
14:755-763, 1998), on a diverse set of complete genomes of 124 eukaryotic species ranging from excavates to 
humans to identify all detectable members of 37 C2H2 ZNF gene families. We succeeded in identifying 3,890 genes 
as distinct members of 37 C2H2 gene families. These 37 families are distributed among the eukaryotes as 
progressive additions of gene blocks with increasing complexity of the organisms. The first block featuring the 
protists had 7 families, the second block featuring plants had 2 families, the third block featuring the fungi had 
2 families (one of which was also present in plants) and the final block consisted of metazoans with 25 families. 
Among the metazoans, the simpler unicellular metazoans had just 15 of the 25 families while most of the 
bilaterians had all 25 families making up a total of 37 families. Multiple potential examples of lineage-specific gene 
duplications and gene losses were also observed. 

Conclusions: Our hybrid approach combines features of the both RBH and HMM methods for homolog detection. 
This largely automated technique is much faster than manual methods and is able to detect homologs accurately 
and efficiently among a diverse set of organisms. Our analysis of the 37 evolutionarily conserved C2H2 ZNF gene 
families revealed a stepwise appearance of ZNF families, agreeing well with the phylogenetic relationship of the 
organisms compared and their presumed stepwise increase in complexity (Science 300:1694, 2003). 
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Background 

The morphological complexity of organisms can be, to a 
certain extent, assigned to the transcription factors that 
control expression of various genes such as those that 
control signal transduction, cell growth, differentiation, 
and development [1]. One such family of transcription 
factors is the Zinc Finger (ZNF) protein family, which is 
the largest family of DNA-binding transcription factors 
in eukaryotes. Of these ZNF proteins, the C2H2 type of 
zinc finger proteins remains the largest group [2], This 
group is characterized by zinc finger domains, consisting 

* Correspondence: aseetharam@indstate.edu 

'Department of Biology, Indiana State University, Terre Haute, IN 47809, USA 
2 Bioinformatics Core, Purdue University, West Lafayette, IN 47906, USA 

Bio Med Central 



of 20-30 amino acid residues with a zinc ion coordinated 
by 2 cysteine and 2 histidine residues. C2H2 ZNF pro- 
teins often contain more than one such finger as tandem 
repeats. These proteins are known to exist in prokary- 
otes and eukaryotes and are most common in mammals. 
It is estimated that more than 700 C2H2 ZNF genes 
exist in humans accounting for more than 2 per cent of 
the total human genes [3]. Most of these C2H2 ZNF 
proteins act by binding DNA duplexes using their zinc 
finger motifs and are involved in controlling expression 
of their target genes. Some C2H2 ZNF proteins also play 
roles as either subunits of transcription proteins, splicing 
factors, or DNA damage repair proteins [4]. It is reason- 
able to assume that as morphologically simpler organisms 
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evolved increasing numbers of genes, they must also have 
developed new control genes, including additional ZNF 
genes, to evolve into more complex organisms. 

With the advent of "next generation" sequencing 
methods and the explosive growth of sequence data- 
bases, faster and more reliable methods for identification 
of gene family members, including the C2H2 ZNF genes, 
are of great interest. The study of the evolution of the 
C2H2 ZNF genes in various genomes may help to eluci- 
date their possible role in the functions associated with 
speciation. Homolog prediction is one of the most vital 
steps in the functional annotation of genomes. The cor- 
rect identification of homologs and putative orthologs 
greatly facilitates the accuracy of downstream analysis 
such as phylogenetic tree construction, protein structure 
prediction, prediction of protein-protein interaction, and 
species classification [5]. An effective and commonly used 
method of homolog/ortholog prediction is Reciprocal-best- 
BL AST-hits (RBH) [6,7], where genes from two species 
are homologs and potential orthologs if they are both best 
BLAST hits when the gene from one genome is used to 
search the other genome. Although RBH is an effective 
procedure, potential homologs in multi-member families 
might be missed due to the restricted amount of informa- 
tion about the gene family in question that is present in 
just two sequences. More sophisticated methods based on 
Hidden-Markov models (HMM) [8] can also be applied 
and are easily automated for homolog detection [5,9]. In 
the HMM method, each family is typically described by 
one or more information-rich HMM profiles that can be 
used to efficiently scan entire genomes for matches. This 
approach in general is very sensitive in detecting homologs 
and can be applied for large-scale, genome-level detection 
[5]. Homolog prediction is especially difficult when mul- 
tiple related gene families are considered, as exemplified 
by the many diverse C2H2 ZNF gene families [2]. The 
high baseline of similarity among the families and subfam- 
ilies of C2H2 ZNF genes, along with their large numbers 
makes automated detection and assignment of C2H2 ZNF 
genes a challenge [2]. Many previous studies have success- 
fully used these methods to uncover a large number of 
C2H2 ZNF gene families. The most prominent of these 
provide a comprehensive cataloging of human KRAB- 
associated ZNF genes that were conserved in mouse, dog 
and chimpanzee [10-12], a description of the SysZNF 
database for all the C2H2 ZNF genes of human and 
mouse [13] and a study on Zinc Finger Associated Do- 
main (ZAD) type C2H2 ZNF gene families in arthropods 
[14]. All these methods either used HMM profiles gener- 
ated from the C2H2 ZNF motif or the pfam domain 
(PF00096) to scan proteomes and identify putative C2H2 
ZNF genes as the first step. Identified genes were then val- 
idated using BLAST or other related methods. In our ap- 
proach for gene homolog detection, we combined both 



RBH and HMM methods in a similar way. But instead of 
using the C2H2 domain for scanning, we used HMM pro- 
files generated from the C2H2 ZNF gene families for the 
initial scanning step. The method is analogous to the 
existing method of orthology detection in expressed se- 
quence tags (EST) called HaMStR (Hidden Markov Model 
based Search for Orthologs using Reciprocity) [15]. Like 
our method HaMStR also uses the forward Hidden 
Markov Model and reverse BLAST search to extend 
existing ortholog clusters with sequences from additional 
taxa. However, unlike HaMStR, which used a large num- 
ber of core orthologs as the reference set, our method only 
used a targeted set of ortholog families that were manually 
identified from 4 different species proteomes. 

To understand the complex evolution of these C2H2 
ZNF gene families, we undertook a survey to identify the 
different members of C2H2 ZNF family genes from all 
the eukaryotes that represent different taxa in the Tree 
of Life. We used the previously identified C2H2 ZNF 
genes of bilaterians (Daphnia, human, worm and fly) as 
our starting point for analyses [4]. These families were 
originally identified as conserved C2H2 zinc-finger gene 
families in bilaterians and were classified based on se- 
quence identity at defined sites [2]. We used the con- 
served gene families of bilaterians to scan other domains 
of the Tree of Life because we assumed that shared an- 
cestry of these families in bilaterians could be extended 
to the lower eukaryotic domain, and that this might 
serve to identify the approximate point of origin of these 
gene families within the phylogeny. Also, the availability 
of well annotated genomes for bilaterians provided high 
confidence in the generation of models and validation of 
identified genes. For the present study, we developed 
and utilized a large subset of partially edited and aug- 
mented HMM profiles representing 37 C2H2 ZNF gene 
families within the bilaterian organisms and then used 
these profiles to predict gene family memberships from an 
extensive variety of 124 completely sequenced eukaryotic 
species. 

Results and discussion 

The hybrid method developed for homolog detection is 
largely automated, rapid, and efficient for identifying 
members of C2H2 ZNF genes. This method utilizes 
HMM profiles of the gene families for initial sensitive 
detection of putative homologs from a variety of ge- 
nomes and then validates these putative homologs using 
a focused BLAST search of a restricted set of well anno- 
tated genomes and comparison to a master list of known 
homologs. This method is logically extensible to any 
number of gene families represented by an HMM, and 
any number of complete genomes (and predicted 
proteomes) available for analysis. The NCBI refseq data- 
base and Swiss-prot provided an excellent resource for 
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the C2H2 ZNF protein sequences used to generate 
HMM profiles after alignment of reference sequences. 
Since the entire analysis was dependent on the HMM 
profile, the quality of the profiles used is crucial. Care 
was taken to choose only families that have representa- 
tive profiles. A total of 37 HMM profiles were generated 
for C2H2 ZNF gene families based on the existing infor- 
mation on those families. 

During the first round of ortholog detection, 124 pro- 
tein datasets belonging to various eukaryotic groups 
were scanned using 37 C2H2 ZNF HMM profiles separ- 
ately. The output obtained consisted of potential homo- 
logs recognized for each profile within each genome. In 
the next step, a focused local BLAST used these poten- 
tial profile-derived homologs individually as queries 
against a set of well annotated, reference genomes. The 
BLAST outputs generated were then scanned for the 
presence of "master list" genes as the top hits in order to 
decide unambiguous membership in the gene families 
represented by the HMMs used. The sequences found 
this way were used to further refine the HMM profiles 
to increase the specificity, and two more rounds of this 
process were performed. The final list of presumed gene 
family members was catalogued in a spreadsheet. 

The final output with 37 HMM profiles on 124 eukary- 
otic genomes identified 3,890 members of a relatively 
complex subset of C2H2 ZNF gene families. All identified 



C2H2 ZNF genes and their numbers across the tree of life 
are provided as spreadsheet in Additional files 1 and 2 re- 
spectively. The profiles generated in this study and the se- 
quences identified are also provided as Additional file 3 
and 4. Although initial HMM profiles were biased with 
more bilaterian sequences, subsequent scans employed 
separate HMM profiles for various eukaryotic groups de- 
rived from the sequences belonging to the respective 
groups. In the present study, 124 genomes were classified 
as 4 different groups. The first group included 30 species 
of protists belonging to excavates (including phyla 
Parabasilia, Fornicate and Euglenozoa), Chromalveolates 
(including phyla Apicomplexa, Ciliophora, Rhizaria, 
Heterokontophyta and Cryptophyta) and Amebozoa. The 
second group consisted of 16 different plant species be- 
longing to Cyanidiophyta, Chlorophyta and Streptophyta. 
The third group had 28 species of fungi with phyla 
Basidiomycota, Ascomycota and Microsporidia. The last 
and the largest group consisted of metazoans with 50 spe- 
cies consisting of Choanozoa, Placozoa, Porifera, Cnidar- 
ian, Nematoda, Annelida, Arthropoda, Echinodermata, 
Tunicata, Cephalochordata and chordate (Additional file 5 
and 6). Heterogeneous representation of various groups 
was mainly due to either a lack of genome sequence or 
non-availability of the proteome datasets. Despite the 
breadth of the organisms scanned, the results (Figure 1) in- 
dicate a clear pattern of gene block conservation within 
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Figure 1 Summarized representation of the distribution of 37 families among various groups of eukaryotes. Numbers in the parentheses 
indicate the number of species under each phylum. Colors indicate various taxonomical groups (Light blue: Protists, Green: Plants, Dark blue: 
Amoebozoans, Purple: Fungus; Yellow: Lower metazoans and Red: Bilaterians). Family names are based on standard bilaterian family names as 
used in our previous publication [4]. 
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closely related organisms as well as a reasonable progres- 
sion of gene family additions that correlates well with a 
presumed increase in organismal complexity. This nearly 
uniform block pattern was occasionally disrupted by the 
presence of "holes" within the blocks (perhaps representing 
a lineage or organism specific gene loss) and the presence 
of "loner" genes (genes that appear to be absent from al- 
most all other closely related organisms). The latter may 
represent putative horizontal gene transfer events. 

Gene families present in all eukaryotes 

Among the 37 C2H2 ZNF gene families, seven families 
(SF3A2, SF3A3, KIN17, ZFP598, ZFP622, ZMAT2 and 
ZNF207) appear to be present in almost all eukaryotes. 
Some exceptions include Discoba, which lacked families 
ZMAT2 and ZNF207, Rizaria, which lacked the ZNF207 
family and microsporidia, which lacked ZMAT2, 
ZNF207, ZF622 and KIN17 families. A phylum/class 
represented by multiple species would be considered to 
have a particular family even if one organism belonging 
to the phylum lacked that gene family. Missing family 
members in some species could merely represent the ab- 
sence of gene models from the data set due to error, in- 
complete sequencing coverage, or incorrect gene model 
prediction. 

All 7 of these gene families have just one homolog in 
almost all the species scanned. SF3A3 (Splicosome factor 
3a subunit 3), SF3A2 (Splicosome factor 3a subunit 2), 
Kinl7, and ZMAT2 (Zinc finger Matrin Type 2) all en- 
code single highly conserved Ul-like C2H2 zinc finger 
domain. ZNF598 (Zinc Finger 598) has five C2H2 zinc 
finger domains, ZNF622 (Zinc finger 622) has four 
C2H2 zinc finger domains and ZNF207 has 2 C2H2 type 
zinc finger domains. SF3A3 and SF3A2 are known to act 
as subunits for RNA splicing machinery [16-18], Kinl7 
is believed to be involved in the cellular response to 
DNA damage, gene expression, and DNA replication, 
and ZNF622 is known to be involved in early T cell acti- 
vation and embryonic development in mouse. The exact 
functions of the other gene families (ZNF207, ZNF598 
and ZMAT2) are not clearly understood. 

The gene families added in plants and amoebozoans 

The next expansion of gene families occurred in plants 
with an addition of the 2 families TFIIIA and YY1. Al- 
though lower plants belonging to phylum Chlorophyta 
(green algae) lacked these families, both families were 
present in all higher plants belonging to phylum 
Streptophyta. These families were represented as single 
homologs in most of the species, except YY1 which had 
2 homologs in class Lillopsida. In addition to TFIIIA 
and YY1, Amoebozoa also had two more families, 
ZNF277 and ZIC. Though these families were not 



present in any other closely related groups (fungi or 
plants), they were present in lower metazoans. 

TFIIIA (Transcription factor III A), with 9 zinc-finger 
domains, is a DNA-binding transcription factor known 
to bind RNA and required for 5sRNA gene expression in 
metazoans [4]. YY1 (Yin Yang 1) generally has 4 zinc- 
finger domains and has multiple functions, both as re- 
pressor and as an activator of gene expression [19]. In 
metazoans, they play roles in induction and patterning 
of the embryonic nervous system, differentiation within 
blood cell lineages, cell-cycle control, cell proliferation, 
differentiation, apoptosis, DNA synthesis and packaging, 
and X-inactivation [19]. The exact role of both these 
families in plants is not well understood. 

Gene family additions in fungi 

Expansion of gene families in fungus included the 
addition of 2 families (TFIIIA and GLI) to the original 7 
families present in all the eukaryotes. Of the 2 families, 
TFIIIA was also present in plants, while GLI was not. 
TFIIIA has just one homolog in all the fungus species, 
as is true for plants and other eukaryotes. Although GLI 
(Glioma-associated oncogene) occurs as a multi-gene 
family in most metazoans, it has one homolog in all fun- 
gus species. In Humans, the GLI family is known to 
regulate various aspects of early development of the cen- 
tral nervous system. 

Gene family additions in metazoans 

The final massive expansion of C2H2 ZNF gene families 
occurred in metazoans with the addition of the re- 
maining 25 gene families (Figure 1). Lower metazoans 
including Choanozoa, Porifera, Cnideria, and Placozoa 
only have a partial representation of these 25 families. 
Choanozoa, considered to be the most basal among the 
metazoans [20], have just 4 families added (KLF, JAZ, 
BLIMP and EVI1). They also lacked the families that 
were added in plants and fungi (GLI, TFIIA and YY1). 
The Porifera phylum, has an additional 4 families (SP, 
EGR, ZXD and MTF1), but compared to Choanozoa, 
they share just one family (KLF) and lack 3 families 
(JAZ, BLIMP and EVI1). Cniderians have all the families 
present in Choanozoa and Porifera except BLIMP and 
EVIL They also have an additional 6 families (SNAIL, 
OVO, GFI, IA1, FEZ and PRDM13) not present in 
Choanozoa and Porifera. Placozoa have all the families 
present in cniderians except MTF1, GLI and TFIIIA. 
Most of the bilaterians have almost all the 25 families 
represented except for a few phyla/classes that lack one 
or more families. The prominent phyla/classes lacking 
the largest number of families are nematodes (lacking 
TFIIIA, ZXD, MTF1, JAZ, PRDM13, and CTCF) and 
urochordates (lacking JAZ, OVO, FEZ, PRDM13, ZEP 
and OAZ). Our observations on a large number of 
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family losses in nematodes was consistent with a previ- 
ous study [2,4]. Also, previously observed massive gene 
losses during the rapid evolution and adaptation of uro- 
chordates to a specialized environment could be the rea- 
son for the missing C2H2 ZNF gene families in 
urochordates [21-23]. Other phyla/ classes lacking one or 
more families include arachnids (lacking ZNF 384), 
some insects (lacking ZXD), echinoderms (lacking 
DISCO), cephalochordates (lacking RREB, ZNF 384/ 
362), neoavians (lacking IA1) and monotremes (lacking 
ZNF 384/362). The complete list of bilaterian specific 
zinc finger families are ZNF384/ZNP362, ZEP, DISCO 
(Disconnected), RREB (RAS responsive element binding 
protein), OAZ (Smad- and Olf-interacting zinc finger 
protein), CTCF, OSR (odd-skipped-related), SPALT, 
ZFHX1 and ZEB. 

Conclusions 

Our approach combined features of both RBH and 
HMM methods of homolog detection. This technique is 
much faster than manual methods and is able to detect 
homologs accurately when compared to RBH alone [4]. 
Furthermore, this method can be easily applied to new 
gene families that can be represented by an HMM, and 
to any number of completed genomes (and predicted 
proteomes) available for analysis. A total of 3,890 genes 
was identified from 124 completely sequenced eukaryotic 
genomes that belong to 37 members of a relatively com- 
plex subset of C2H2 ZNF gene families. These gene fam- 
ilies in eukaryotes revealed a stepwise evolutionary 
process of gene block additions, which agrees well with 
the phylogenetic relationship of the organisms [20], as well 
as a presumed increase in organismal complexity. 

Out of the 37 total families, 7 families are present in 
all eukaryotes. The increased morphological complexity 
from primitive protists to plants or fungi involved 
addition of two families, with one family common to 
both fungus and plants. The final expansion in meta- 
zoans added 25 families to those present in other groups 
(protists, plants and fungi) and this expansion correlates 
with the large increase in morphological complexity of 
these organisms. Although choosing bilaterian conserved 
gene families to scan the other eukaryotes made this 
study biased towards bilaterians, it also allowed us to 
specifically trace the appearance, deletion, and expansion 
of these families during the course of eukaryotic evolu- 
tion. Most gene families resistant to expansion (single 
member gene families) are highly conserved and are rep- 
resented in most of the eukaryotic species. We assume 
that these families are present in the common ancestor 
of eukaryotes as they are involved in fundamental pro- 
cesses such as DNA damage repair and intron splicing. 
The remarkable conservation of these gene families with 
respect to sequence, as well as their ability to resist 



expansion, is consistent with previous observations 
[24-27]. Those functioning as structural proteins, patho- 
gen response proteins, stress related proteins, signalling 
proteins, and proteins acting as transcription factors are 
often more prone to lineage specific expansions than are 
proteins that are involved in basic cellular functions like 
DNA modification and RNA metabolism [28]. It is still 
unclear why specific gene families undergo massive ex- 
pansion while some remain unchanged across evolution- 
ary distances. In general, C2H2 ZNF gene families with 
one or two ZNF domains are more resistant to expan- 
sion while multi ZNF domain containing families are 
not. It is assumed that such expansion occurs due to the 
modular structure of the multi-ZNF containing genes 
that provides a favourable platform for developing novel 
functionalities [29]. It has been hypothesized that 
lineage-specific expansions are a principle means of 
adaptation and one of the most important sources of 
organizational and regulatory diversity in many organ- 
isms during transitions towards higher complexity [28]. 

Methods 

Generating HMM profiles 

Previously identified putative orthologs that were 
present in the common ancestor of the bilaterians Homo 
sapiens, Drosophila melanogaster, and Caenorhabditis 
elegans were used as a focus for the present study [2,4]. 
For each of those families described, additional members 
belonging to various bilaterians that had well annotated 
genomes were collected from databases such as NCBI 
refseq [30] and Swissprot [31]. Both, similarity based 
searches as well as key-word based searches were used 
to retrieve the sequences. These families were further 
augmented with additional validated gene members de- 
scribed in the literature [3,4,32-34]. These sequences 
were aligned using the MUSCLE multiple sequence 
aligner program. The Hidden-Markov model (HMM) 
profiles for each of these families were created with the 
hmmbuild option of the HMMER 3.0 [35] package. The 
reference sequences were obtained from diverse taxa in 
order to make the profiles more representative of the ge- 
nomes chosen for study. 

Obtaining eukaryotic protein datasets 

The protein datasets of completely sequenced organisms 
representing all major eukaryotic clades were downloaded 
from NCBI, Ensembl, JGL and Sanger. The downloaded 
genomes were then categorized into various class/phyla 
based on NCBI taxonomy information. Complete lists of 
the included species are given in Additional file 5, and the 
sources for these genomes along with their build numbers 
are provided in Additional file 6. The obtained genomes 
were sorted taxonomically into 4 groups as protists, 
plants, fungi and metazoans. 
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HMM profile search 

Whole predicted proteomes of the various species were 
scanned with all created HMM profiles using the 
hmmsearch option of the HMMER 3.0 package using 
minimum e-value threshold of 0.001. A loop written in 
Bash script was used to complete the reiterative 
hmmsearch procedure and the processing of results. For 
each HMM-genome pair, sequence hits were sorted based 
on the score for the full sequence and then on the best do- 
main score. Only those sequences that had scores greater 
than 100 were chosen to be used in BLAST searches 
(standalone BLAST version 2.2.25. from NCBI) [7]. 

BLAST search 

Standalone BLAST was performed using the chosen se- 
quences against a local sequence database consisting 
only of the well annotated, complete set of genes 
from Homo sapiens, Drosophila melanogaster, and 
Caenorhabditis elegans. Subsequently, no more than 3 
best hits from these focused BLAST results were scanned 
for accession numbers that matched a master list of such 
numbers. This master list was constructed using only 
those genes from the three reference organisms that were 
members of a given HMM profile/family. This pairwise 
process was repeated for each profile and each genome. 
Only the sequences that identified the correct family as 
verified by the master list accession numbers were chosen 
as family members. 

Increasing specificity 

The process was repeated two more times after adding 
the identified members from the previous round to gen- 
erate a new HMM profile for the family. In order to in- 
crease the specificity of ortholog detection, during the 
second round, separate HMM profiles were generated 
for each of four taxonomic clades protists, fungi, plants, 
and metazoans. For those families for which the se- 
quence data was not available for different clades, gen- 
eral profiles were again used in the second round after 
updating the HMM profile with new sequences. 

We carefully re-examined all the species lacking fam- 
ilies that were otherwise present in closely related spe- 
cies of the group/clade. To make sure that these families 
were likely to be missing rather than just difficult to 
identify not because of the poor quality of the proteome 
or poor annotation, we performed a focused blastp/ 
tblastn search of these proteomes/genomes using se- 
quences from closely related species. If no members 
were found after two rounds of HMM profile search and 
focussed BLAST search, the family was declared missing 
from the species. All the sequences identified as 
orthologs in the respective family were then catalogued. 
Those families that had multiple members were then 
analysed to determine whether they were truly 



parologous or just duplicate sequences by aligning them 
using clustalw software [36]. 
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